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The scale free structure p(k) ~ A; -7 of protein-protein interaction networks can be reproduced by 
a static physical model in simulation. We inspect the model theoretically, and find the key reason for 
the model to generate apparent scale free degree distributions. This explanation provides a generic 
mechanism of "scale free" networks. Moreover, we predict the dependence of 7 on experimental 
protein concentrations or other sensitivity factors in detecting interactions, and find experimental 
evidence to support the prediction. 



1. Introduction 

"Scale free" networks have been observed in many 
areas of science [l| including social science, biology 
and internet, where degree distributions follow (albeit 
noises) the power law form p(k) ~ k^ 1 within one 
or two orders of magnitude for k. Here the degree 
k is the number of links a node has, and p(k) is the 
probability of a node to have degree k. An important 
scale free network under exp erimental [j , H, H, HI and 
theoretical [E @, 0, 0, & M, HI El study is the 
protein-protein interaction (PPI) network, where a link 
between two proteins indicates a large enough binding 
energy between them. These studies bare the goal that 
the topology of PPI networks could reflect how systems 
of various proteins have evolved in biological organisms. 

It was pointed out recently that scale free PPI networks 
could also result from variation of surface hydrophobic- 
ities of proteins. Starting from an approximately Gaus- 
sian distribution of surface hydrophobicity, the static 
model successfullyproduced scale free networks in com- 
puter simulations [6( . 

Why can this static model generate scale free net- 
works? As a counterpart of the simulation results in 
Ref. Q , in this paper we study the model from a theoreti- 
cal perspective, and reveal the key reason that the model 
leads to "scale free" networks. More importantly, our 
numerical and analytical study reveals the dependence 
of power 7 on experimental sensitivity factors, such as 
protein concentration, in detection of PPI, and provides 
a possible explanation to the observed variation of 7 in 
different high-throughput PPI experiments. 

2. The static model 

Let us first briefly introduce the model proposed by 
Deeds et al.Q. For the compositions of surface residues 
of yeast proteins in high-throughput experiments 0, 0], 
the fractions of hydrophobic residues, noted as p, follow 
a Gaussian distribution 

„/ \ 1 (p-p) 2 . , 

with mean value p~ 0.2 and deviation a ~ 0.05. This 
results in an approximately Gaussian distribution of the 



surface "stickiness" K, and the binding free energy of two 
proteins is determined by the sum of their "stickiness" . 
In a more detailed description, there are K% hydrophobic 
residues among the M surface residues on protein i, and 
M = 100 is assumed to be a constant for all proteins. 
The probability to find a protein with K hydrophobic 
surface residues is 

p E {K) = Jdp f(p) p K (l~p) M - K . (2) 

It can be seen that p E (K) is close to a Gaussian distri- 
bution (Fig. [Ik.). 
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FIG. 1: (color online), a) Hydrophobicity distribution p E (K) 
in Eq. ((2} for iV = 5000. The K region in red is the same as 
in the inset, b) The dependence of expected degree k upon 
hydrophobicity K, for K c = 83. The range 1 < k < 100 is in 
red. 

The binding of protein i and j is determined by the 
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binding free energy 



AG 



-crit 



-(Ki + K^Fo + G 



(0), 



(3) 



where AG is negative for a strong binding, Fq is the 
change of binding free energy upon burial of each 
hydrophobic residue, and G( ) — 6kCal/Mol ~ lOkgT 
is a constant value determined by experiments [3, [la ]. 
In support of this model, Fig. 3 of Ref.[l4| showed 
that experimental result of binding energies can be 
described by the sum of stickiness terms and a constant 
term. If Ki + Kj > K c the interaction is experimentally 
detectable, and the two proteins are labeled as linked in 
the PPI network. 
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3. Results and interpretations 

3.1. Degree distributions 

We calculate p(k) numerically (see Numerical method 
for details) with given values of TV and K c , where TV is 
the total number of proteins in the network, and obtain 
apparent "scale free" structure p(k) oc fc~ 7 (Fig. [2]). We 
set the default situation as TV = 5000 and K c = 83 to 
fit 7 = 2. Fig. [2] indicates that the apparent slope 7 
increases with K c , and increases as TV decreases. More 
explicitly, the dependence of 7 upon K c is plotted in 
Fig.H " 
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FIG. 2: "Power law" degree distribution p(k) for different 
situations, with a solid line indicating slope 7 = 2. Circles 
(default): TV = 5000 and K c = 83; dots: TV — 5000 and K c = 75; 
squares: TV = 1000 and K c = 75. Only data with p(k) > 
are shown. 

Let us interpret these results by analytical approaches. 
A protein with hydrophobicity K has a pass/fail line 



K , - K . 



(4) 



Proteins with hydrophobicity above Ku ne are linked to 
it, while those with hydrophobicity below Ku ne do not. 
Therefore the protein with hydrophobicity K has an av- 
erage degree 



TV 



p E (K')dK' 



(5) 



FIG. 3: Dependence of the power 7 upon experimental sensi- 
tivity in detecting interactions. 7 increases with K c , and K c 



is replaced on the top by 



from Eq. dT2p) . The error bar 



at K c < 78 comes mostly from undulations. The slight off 
p(k — 1) produces bigger error bar at K c > 88 where there 
are less k data points. The solid line is the approximation Eq. 



In the mean field approximation the degree of the protein 
k is just k, and the degree distribution is 



p(k)=p E (K)^ 



P E (K) 



Np E (K c -K) 



(0) 



Beyond mean field approximation its degree fluctuates 
with deviation ~ which will be addressed later. 

Let us restrict the discussion within the mean field 
approximation for the moment. We can notice that the 
experimentally observable range 1 < k < 100 only covers a 
small range of hydrophobicity (39 < K < 48 for the default 
situation), as indicated by the short red line in Fig.[]i>. In 
this range the hydrophobicity distribution p E (K) is very 
close to exponential, since the short red line in Fig. is 
nearly straight. So we can use linear approximation to 
produce the nearly straight lines in Fig. [21 Define 



and 



a d\np E (K') t 
G = dK^^ 



A dlnp E (K') t 

= — \K 



then Eq. ([5]) give k = e 
to p(k) = e 



dK' 



aK-\-const 



(7) 



(8) 



- (a+6) K+const 



, while Eq. (O leads 
As a result we have p(k) ~ 



K, 



k 1+ a . This is a "scale free" network with 7 = 1+b/a. 

To understand the undulations in p{k) at large k in 
Fig. [21 we must go beyond the mean field approximation 
and deal with the fluctuation of degree with magnitude 
for a given k. Noticing the K values are discrete in- 
tegers, each K value produces a peak in p(k), centered at 
k and with width Since k grows with K almost ex- 
ponentially, the distance between nearest neighbor peaks 
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k(K+l) — k(K) grows linearly with k. The undulations 
emerge at large enough k, when the peak distance ex- 
ceeds the peak width Vk. 

Now we are ready to study the dependence of the 
slope 7 on parameter K c in Fig. [3J Approximating 
the hydrophobicity distribution as Gaussian distribution 
\np E (K) ~ — (K — Kq) 2 , where Kq is the most probable 
hydrophobicity value, we have 



7 = 1 + - 
a 



1 



K c - K u 



Kn 



K 



(9) 



We find K ~ 20 in Eq. and K Une ~A\.h is nearly a 
constant from Eq. ([5]) for typical degree k ~ 5, then 7 is 
a linear function of K c in Eq. ([9]), and forms a straight 
line (solid) in Fig. [3] 

3.2. Dependence on experimental sensitivity 
Different 7 values have been obtained in different 
PPI experiments, varying from 7 « 2.1 to 7 w 2.5 
[E EL H H, S EH- To explain this variation, we no- 
tice that different experiments might have different 
sensitivity in detecting PPI. Indeed, some interactions 
detectable in one experiment might be too weak to be 
detected in another experiment. An example of factors 
affecting experimental sensitivity is protein concentra- 
tion/level, which is in turn controlled by gene expression 
and dependent upon the specific technique used to detect 
PPI. Even for the same experiment, the sensitivity in 
detecting interactions is actually reduced by setting a 
higher standard in identifying PPI, e.g., selecting only 
highly repeatable PPI data which effectively correspond 
to interactions with high affinity. 

Let us study how 7 depends on these experimental sen- 
sitivity factors. In high-throughput experiments the con- 
centration of protein-protein complex Cy must be high 
enough to be detected 



CiCj 



■ exp 



AG 
k^T 



>C C 



(10) 



where the binding free energy AG is given by Eq. ([3]), 
Gj and Cj are the concentrations of proteins i and j 
in monomeric form, and the normalization concentration 
Go = 1M is the convention. Rewriting this relationship 
in the form of association constant, the binding affinity 
should be strong enough to be detectable 



1 



exp 



{Ki + K } )F -G^ 



k B T 
K C F - G( ) 



k B T 



G: 



CiCj 



(11) 



Thus the parameter K c of the model is determined by 
experimental protein concentrations 



K c = 



fcsTln 



GoG cr i 
CiCj 



G 



(0) 



/F . (12) 



To estimate the only unknown parameter Fq in this 
equation, we notice that for yeast two hybrid screening 
technique the PPIs with binding affinity JC a > %^ — 

1/iA/^ 1 are detectablepjj]. If we use 7 « 2.3 and K c = 87 
for this threshold binding affinity, we can obtain an esti- 
mate Fq ss 0.28fcsT. With the help of this value we can 
use Eq. (jT2J) to convert the x-axis of Fig. [3] from K c to 
experimental variable %^ (top of Fig. [3]). 

It can be seen from Fig. [3] that lower sensitivity, or 
lower CiCj, leads to higher 7. This can be realized 
by lower protein concentrations through reduced gene 
expressions, or selecting only highly repeatable data of 
detected PPIs. This prediction is confirmed by Figure 
2 a of Ref. |10j, which clearly shows that the core data 
set of Ito et al. [H, containing only PPIs identified by 
at least three independent sequence tags, generates 
a steeper degree distribution than the full Ito data 
set does. Obviously the Ito core data corresponds to 
relatively strong interactions, manifest in high K, a and 
K c . Note that the horizontal dots with p(k) = l/N 
at high k in Figure 2a of Ref. [HI should be excluded 
when fitting the slope 7, because they are actually 
in the p{k) < l/N region where a few nodes with 
arbitrary degree k emerge occasionally. On the other 
hand, the protein concentrations in the yeast two 
hybrid experiments [H, |j[ are not yet available, and the 
prediction about dependence of the slope 7 upon protein 
concentration needs verification from future experiments. 

3.3. Clustering coefficient 

We also study another important property of networks, 
clustering coefficient C(k), and show the numerical result 
of the model in Fig. [4] If a protein is linked to k proteins, 
the average number of links between the k proteins, t{k), 
cannot exceed k(k — 1)/2. Here the averaging includes 
all possible realizations. The clustering coefficient is 
C{k) = ^p^j < 1. Similarly to Ref. [13, [3, we obtain 
(Fig. HI) C(k) ~ 1 at small k and C{k) ~ k~ 2 at large k. 
The experimental result 0, 0, 0, E3] has a similar shape 
with slope » 2 for large k, and C(k) is smeared between 
1 and 10 _1 for small k. If we attribute the discrepancy 
between the model and experiment at small k to false 
negatives, the model is in reasonable agreement with 
experiments. 

A physical picture is helpful to interpret this result. 
As mentioned above, if there are the k proteins linking 
to the same protein, their hydrophobicity exceed Ku ne , 
while the hydrophobicity of all other proteins are below 
Ku ne . The mean field relationship between Ku ne and k 
is Eq. (|5|). If we have Ku ne > Kc/2 at a small degree k, 
then the most hydrophobic k proteins are all connected, 
and C(k) is 1. At large enough k, however, Ku ne <K c /2 
and not all proteins above Ku ne are linked to each other. 
Then the clustering coefficient is determined by 

C(k) J* li * K ^{K:-K.K li „} dK *PB (jggg (^) (13) 

J^dK.f^dKzpJKJp^Kz) 



4 



c(k; 

10 c 
10" : 

io- ! 



2 i'CNX 100 1000 5000 




k 



FIG. 4: The clustering coefficient distribution C(k) for the 
default situation . The solid line indicates slope —2. 
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FIG. 5: The diagram of Eq. (|13[1 to find the behavior of clus- 
tering coefficient C(k). The numerator is the integral over the 
shadowed region, while the denominator is the integral over 
the square region. 



enough N. The exact form of Eq. (O is 

M 

k=N- P E (K'), (14) 

K'=Max{K c ~K,0} 

and the degree distribution is 

M , v 

P(k) =^2p b (K) r k )(k/N) k (l-k/N) N - k . (15) 

Instead of mean field result Eq. (fl~3|) . the clustering 
coefficient is calculated as 

Af-l M 

c(k)=Y,MK)/l E fe(^K(^)] 

i<"=0 K lt K 2 =K 
M 

K[YjP E {Kx)p B {K 2 )e{Kx +K 2 -K C + 1/2)]} (16) 



K 1 ,K 2 =K 



where 



0{K) 



1 K > 



X < 
is the usual Heaviside step function, and 



(17) 



M 



K-l 



K-2 



W =(k) {K ' )]k{ (K')f- k -Yv E (K')f- k } 



K'=0 



K'=0 



(18) 

is the probability that k proteins have hydrophobicity 
> K while the maximum hydrophobicity of the rest 
N — k proteins is K — 1 . 

5. Conclusion and outlook 



The denominator is proportional to k according to 
Eq. (|5|). It corresponds to the square region between 
Ki ine and M in Fig. [5J The numerator, corresponding 
to the shadowed region in Fig. O is dominated by the 
region near the cutting line K\ + K 2 — K c , because 
p E (K) is nearly a sharp exponential function. Hence 
the numerator scales as the length of the cutting line, 
K c — 2Kn ne oc In k + const. Therefore, in agreement 
with Boguna et al. 18], the numerator is a slow function 
of k compared to the denominator, and the clustering 

coefficient scales as C(k) ~ k at large k. And at 
small k the square is totally in the shadow, leading to 
C{k) ~ 1. The step like shape of C(k), however, comes 
from the discreteness of integer K values. 

4. Numerical methods 

We calculate p(k) as an average of all possible re- 
alizations. The calculation is done with integer K 1 
and without mean field approximation. We ignore the 
unimportant difference between N and N + 1 for large 



We study a static physical model to explain scale 
free PPI networks. We notice that the experimentally 
observable part of degree distribution covers a limited 
range (from k = 1 to k < 100), and corresponds to 
a small range of hydrophobicity. The hydrophobicity 
distribution p E (K) in this small range is close enough 
to an exponential distribution. Therefore a linear ap- 
proximation leads to the "scale free" degree distribution 
p(k) ~ fc -7 , with 7 dependent on the threshold parameter 
K c and network size N . In experiments K c depends on 
the sensitivity factors, such as protein concentration, 
in detection of PPI. Our result provides a possible 
interpretation to the difference in experimental 7 values, 
and predicts the dependence of 7 on experimental 
sensitivity factors. This prediction is supported by 
the slope change [10( when comparing Ito data set and 
Ito core data set [4}, and dependence of 7 on protein 
concentrations needs experimental verification in future. 
The distribution of another network property, clustering 
coefficient, produced in the model is also in reasonable 
agreement with that of experiment [ic| and previous 
theoretical descriptions [Til [l8j . 
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The hydrophobicity distribution in the physical model 
has been arranged to reflect the reality in a simplified 
way. While the real distribution of protein "stickiness" 
can be somewhat different from it, the generation of 
"scale free" network will not be sensitive to the differ- 
ence. More generally, "scale free" degree distributions 
can be also produced by many smooth distributions of 
hydrophobicity, such as binomial, Gaussian, Poisson dis- 
tributions and their modifications. This can be one of 
the reasons that scale free (in a limited range) networks 
are so widely observed. 

A major part of PPI networks is obtained by the high- 
throughput yeast two hybrid screening 0, Q|. This tech- 
nique often produce a large fraction of false positives [l9| 
which do not correspond to any real biological function, 
while real functional interactions presumably constitute 
a smaller portion of the detected result. While func- 
tional PPIs may involve formation of additional hydrogen 
bonds and salt bridges to obtain adequate binding affin- 
ity, these nonfunctional PPIs have not been evolutionar- 
ily selected and are formed primarily due to hydrophobic 
effect. In this model we show that a simple static network 
of nodes with different "stickiness" can readily appear to 
be scale free. To this end, we use Eq. ([3]) because the 
nonfunctional PPIs are just random interfaces between 
two proteins without experiencing the evolutionary de- 
sign of pairwise interface patterns. Moreover, this model 
could be used to extract information of nonfunctional 
interactions between unrelated proteins which randomly 
encounter in a real cell, and such information is in turn 
important in probing the general principles for cells to 
organize proteins in a cell. Namely, the stronger non- 
functional interactions, the more unrelated proteins in- 
terfere with each other, and the less protein types can 
coexist. Hence the nonfunctional interactions can limit 
the proteome size of a single cellular organism. Inter- 
esting related questions include the change of how much 
living cells have to do in constricting the nonfunctional 
interactions in the course of protein evolution, as well as 
the impact of higher temperature for thermophile organ- 
isms. 

If the distribution of "stickiness" is simply an expo- 
nential function, hip E (K) ~ ~K, the model is simplified 
to a = b and 7 = 2. This reduced simple situation would 
then be in complete agreement with one of the math- 
ematical examples of networks briefly mentioned in by 
Caldarelli et al.[2(|, which has been applied to realistic 
networks such as gene regulation network [21|. Our find- 
ing indicates that this simple mathematical form[2(| have 
more important impacts to systems in reality. Indeed, it 
is reasonable to expect distribution of many qualities, 
such as annual personal income and eagerness to learn 
knowledge, to be fitted by an approximately exponential 
distribution at least in some short range, and with suit- 
able arrangements power law distribution might emerge. 

Masuda et al. [13] followed the suggestion of Caldarelli 
et al.[2(| and studied essentially similar models to ours. 
But they did not relate the mathematical models to real 



systems. More importantly, they emphasize that the 
slope 7 = 2 is universal, while the slope in our study 
not only deviates from 2, but also dependents on exper- 
imental properties such as expression levels of proteins. 

In contrast to this static model, most models of PPI 
networks focus on the development history of the net- 
work through gene duplications [IH [l2| . which is similar 
to "preferential attachment" in growing networks [l3j. 
It was found [la] that the network structure of the gene 
duplication model analytically approaches scale free [l2j 
at k — * oo if links of new nodes should be deleted by a 
probability larger than 1/2, and the degree distribution 
is comparable with experiments. Our approach serves as 
an alternative way to obtain "scale free" PPI network. 
Further experiments, such as systematic study of depen- 
dence of apparent power 7 on gene expression level, or 
other measures of protein concentration, will help clarify 
whether the static model or gene duplication mechanism 
is mainly responsible for the observed scale free nature 
of PPI networks. 

Glossary 

Protein-protein interaction network. A network of 
many types of proteins of an organism; each type of 
protein is a node in the network. Two nodes are labeled 
as linked if the two types of proteins can interact with 
each other with sufficient affinity. 

Degree. The number of links a node has in the 
network. If a node in the protein-protein interaction 
network has degree k, this protein can interact with k 
other types of proteins. 

Scale free network. In such a network, the number 
of nodes with degree k decreases with k, and the 
dependence is a power law function. 

Yeast two hybrid. A molecular biology technique 
used to discover protein-protein interactions by testing 
for physical interaction/binding between two proteins, 
respectively. This technique is able to test interactions 
between a large amount of proteins rapidly (so called 
high-throughput screening). 

sensitivity in detecting interactions. Only strong 
enough interactions between proteins are identified 
as "interacting" pairs. If the sensitivity in detection 
becomes higher, slightly weaker interactions becomes 
detectable, and more interactions are detected. 

Surface hydrophobicity. The fraction of hydropho- 
bic amino acids among the amino acids on the surface 
of a protein. If hydrophobic amino acids are buried 
either in formation of a protein or in formation of 
a protein-protein complex, they are not in contact 
with water any more, and thus lowers the total free 
energy. Hydrophobic effect is important in the interac- 
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tion of proteins, especially in non-functional interactions. Acknowledgement: This work is supported by NIH. 
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